Thinking Hallucination for Video Captioning
نویسندگان
چکیده
AbstractWith the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite performance improvement, models are prone to hallucination. Hallucination refers generation highly pathological descriptions that detached from source material. In captioning, there two kinds hallucination: object action Instead endeavoring learn better a video, in this work, we investigate fundamental sources hallucination problem. We identify three main factors: (i) inadequate features extracted (ii) improper influences target contexts during multi-modal fusion, (iii) exposure bias training strategy. To alleviate these problems, propose robust solutions: (a) introduction auxiliary heads trained multi-label settings on top (b) addition context gates, which dynamically select fusion. The standard evaluation metrics for measures similarity with ground truth captions do not adequately capture relevance. end, new metric, COAHA (caption assessment), assesses degree Our method achieves state-of-the-art MSR-Video Text (MSR-VTT) Microsoft Research Video Description Corpus (MSVD) datasets, especially by massive margin CIDEr score.
منابع مشابه
Reconstruction Network for Video Captioning
In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (...
متن کاملConsensus-based Sequence Training for Video Captioning
Captioning models are typically trained using the crossentropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each st...
متن کاملDeep Learning for Video Classification and Captioning
Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification con...
متن کاملMultimodal Memory Modelling for Video Captioning
Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is st...
متن کاملGrounded Objects and Interactions for Video Captioning
We address the problem of video captioning by grounding language generation on object interactions in the video. Existing work mostly focuses on overall scene understanding with often limited or no emphasis on object interactions to address the problem of video understanding. In this paper, we propose SINet-Caption that learns to generate captions grounded over higher-order interactions between...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2023
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-26316-3_37